Prerequisite for XGBoost package

1.1 Load required libraries
PART ONE | TOTAL | 30

Domain: Telecom

Context:

A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.

Data Description:

Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:

● Customers who left within the last month – the column is called Churn

● Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

● Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

● Demographic info about customers – gender, age range, and if they have partners and dependents

Project Objective:

Build a model that will help to identify the potential customers who have a higher probability to churn. This helps the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategizing customer retention.

Steps and Tasks:

1. Data Understanding and Exploration: 5

a. Read ‘TelcomCustomer-Churn_1.csv’ as a DataFrame and assign it to a variable. [1 Mark]

b. Read ‘TelcomCustomer-Churn_2.csv’ as a DataFrame and assign it to a variable. [1 Mark]

c. Merge both the DataFrames on key ‘customerID’ to form a single DataFrame [2 Mark]

d. Verify if all the columns are incorporated in the merged DataFrame by using simple comparison Operator in Python. [1 Marks]

**Observed that there are total 10 columns in df_Churn1 dataframe and 12 columns in df_Churn2 dataframe

After merge on customerID the new dataframe consist 21 Columns(customerID column is key) Hence merge is successful ** </span>

2. Data Cleaning and Analysis : 15

a. Impute missing/unexpected values in the DataFrame. [2 Marks]

**Observed :There are no null values in dataframe **

**Observed :Column TotalCharges has 11 rows with a space ,hence to be imputed with mode **

**Observed : All columns in df_Telecom dataframe are of "object" Datatype except

SeniorCitizen and tenure are of "int64"

MonthlyCharges id "float64" ** </span>

Check for unexpected values in each variable of type"Object" and impute with best suitable value

**Observed :

Below columns to have only Yes no values hence to be updated wherever required

Columns "MultipleLines has 3 values['No phone service' 'No' 'Yes'] hence 'No phone service' to be updated with "No"

OnlineSecurity,OnlineBackup ,DeviceProtection ,TechSupport ,StreamingTV ,StreamingMovies has 3 values['No' 'Yes' 'No internet service'] hence 'No internet service' to be updated with "No" ** </span>

b. Make sure all the variables with continuous values are of ‘Float’ type. [2 Marks]

[For Example: MonthlyCharges, TotalCharges]

Drop customerID column because the customerID does not determine the probability that someone will churn or not

The column 'SeniorCitizen' is a categorical column by its nature with 'Yes' as 1, and No as 0. So it shuold be converted into Categorical type

Visualize a boxplot to check distribution of the features

c. Create a function that will accept a DataFrame as input and return pie-charts for all the appropriate Categorical features. Clearly show percentage distribution in the pie-chart. [4 Marks]

d. Share insights for Q2.b. [2 Marks]

TotalCharges is continuous variable which is converted to float

Churn Distribution with respect to below Features:

Gender : There is negligible difference in customer percentage/ count who changed the service provider. Both genders behaved in similar fashion when it comes to migrating to another service provider.

Contract: About 75% of customer with Month-to-Month Contract opted to move out as compared to 13% of customers with One Year Contract and 3% with Two Year Contract

Payment Method: Major customers who moved out were having Electronic Check as Payment Method. Customers who opted for Credit-Card automatic transfer or Bank Automatic Transfer and Mailed Check as Payment Method were less likely to move out.

Dependant: Customers without dependents are more likely to churn

Partners: Customers that doesn't have partners are more likely to churn

Senior Citizen: It can be observed that the fraction of senior citizen is very less. Most of the senior citizens churn.

Online security: Most customers churn in the absence of online security

Paperless Billing:
Customers with Paperless Billing are most likely to churn.

TechSupport:
Customers with no TechSupport are most likely to migrate to another service provider.

phone service: Very small fraction of customers don't have a phone service and out of that, 1/3rd Customers are more likely to churn. </span>

e. Encode all the appropriate Categorical features with the best suitable approach. [2 Marks]

There is high positive relation between TotalCharges and tenure 82%

MonthlyCharges id positively corelated with StreamingTV , TotalCharges and InternetService_Fiber optic

MonthlyCharges is highly negatively related with InternetService_No </span>

f. Split the data into 80% train and 20% test. [1 Marks]

g. Normalize/Standardize the data with the best suitable approach. [2 Marks]

Scaling all the variables to a range of 0 to 1 with Normalization

All Categorical features are already in the range between 0 to 1. and normalization would not affect their value.

Standardizing the categorical features would mean assigning a distribution to categorical features.We don’t want to do that!

</span>

3. Model building and Improvement: 10

a. Train a model using XGBoost and use RandomizedSearchCV to train on best parameters. Also print best performing parameters along with train and test performance. [5 Marks]

Model Score

Classification report for Training Set

Classification report for Test Set

From the classification report, it can be seen that the model has an average weighted performance of around 80% ranging from precision, recall, f1-score, and support. Test Recall is less in comparative to Training

With the default parameters, XGBoost get 0.52 Recall and 0.64 Roc_Auc. Also Recall s for Training data is 0.85 which concludes model is overfit

XGBoost correctly predicted less than half of the churned customers.

SMOTE Algorithm

We can see the target is balanced after sampling

SMOTE Algorithm has oversampled the minority instances and made them equal to the majority class.

Both categories have an equal amount of records.

More specifically, the minority class has been increased to the total number of the majority class.

Now see the accuracy and recall results after applying SMOTE algorithm (Oversampling).

</span>

Again train the same previous model on balanced data

Classification report for Training Set

Classification report for Test Set

From the classification report,we can see there is increase in accuracy. Also test recall score is increased from 54% to 85 % which is good.

</span>

Classification report for Training Set

Classification report for Test Set

From the classification report,we can see there is no change in accuracy, precision and recall and f1 score post RandomSerchCV Hypertuning parameter for minority class

Also test precision and recall and f1 score is decreased from 85% to 84 % .

Training accuracy is 97% </span>

B. Train a model using XGBoost and use GridSearchCV to train on best parameters. Also print best performing parameters along with train and test performance. [5 Marks]

Classification report for Training Set

Classification report for Test Set

From the classification report,we can see there is no change in accuracy, precision and recall and f1 score post GridSearchCV Hypertuning parameter for minority class

Also test precision and recall and f1 score is increased from 85% to 86 % .

Training accuracy is 97% not affected </span>

Customer churn is definitely bad to a firm ’s profitability. Various strategies can be implemented to eliminate customer churn. The best way to avoid customer churn is Improving customer service is, Building customer loyalty through relevant experiences and specialized service .

</span>

PART TWO | TOTAL | 30

DOMAIN: IT

• CONTEXT:

The purpose is to build a machine learning pipeline that will work autonomously irrespective of Data and users can save efforts involved in building pipelines for each dataset.

• PROJECT OBJECTIVE:

Build a machine learning pipeline that will run autonomously with the csv file and return best performing model.

• STEPS AND TASK [30 Marks]:

  1. Build a simple ML pipeline which will accept a single ‘.csv’ file as input and return a trained base model that can be used for predictions. You can use 1 Dataset from Part 1 (single/merged).
  2. Create separate functions for various purposes.
  3. Various base models should be trained to select the best performing model.
  4. Pickle file should be saved for the best performing model.

Include best coding practices in the code:

• Modularization

• Maintainability

• Well commented code etc.

1. Data Understanding:

Group similar functions

Calling function

Model Building

Base models

LogisticRegression

DecisionTreeClassifier

KNeighborsClassifier

RandomForestClassifier

XGBClassifier

AdaBoostClassifier

BaggingClassifier

Logistic Algorithm is best in all the model as its accuracy is highest 82%

ML Pipeline

As per above pipeline GridsearchCV Logistic Regression is best classifier with highest accuracy score 82%